Improving the performance of dictionary-based approaches in protein name recognition

نویسندگان

  • Yoshimasa Tsuruoka
  • Jun'ichi Tsujii
چکیده

Dictionary-based protein name recognition is often a first step in extracting information from biomedical documents because it can provide ID information on recognized terms. However, dictionary-based approaches present two fundamental difficulties: (1) false recognition mainly caused by short names; (2) low recall due to spelling variations. In this paper, we tackle the former problem using machine learning to filter out false positives and present two alternative methods for alleviating the latter problem of spelling variations. The first is achieved by using approximate string searching, and the second by expanding the dictionary with a probabilistic variant generator, which we propose in this paper. Experimental results using the GENIA corpus revealed that filtering using a naive Bayes classifier greatly improved precision with only a slight loss of recall, resulting in 10.8% improvement in F-measure, and dictionary expansion with the variant generator gave further 1.6% improvement and achieved an F-measure of 66.6%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Method for Speech Enhancement Based on Incoherent Model Learning in Wavelet Transform Domain

Quality of speech signal significantly reduces in the presence of environmental noise signals and leads to the imperfect performance of hearing aid devices, automatic speech recognition systems, and mobile phones. In this paper, the single channel speech enhancement of the corrupted signals by the additive noise signals is considered. A dictionary-based algorithm is proposed to train the speech...

متن کامل

A New IRIS Segmentation Method Based on Sparse Representation

Iris recognition is one of the most reliable methods for identification. In general, itconsists of image acquisition, iris segmentation, feature extraction and matching. Among them, iris segmentation has an important role on the performance of any iris recognition system. Eyes nonlinear movement, occlusion, and specular reflection are main challenges for any iris segmentation method. In thi...

متن کامل

A New IRIS Segmentation Method Based on Sparse Representation

Iris recognition is one of the most reliable methods for identification. In general, itconsists of image acquisition, iris segmentation, feature extraction and matching. Among them, iris segmentation has an important role on the performance of any iris recognition system. Eyes nonlinear movement, occlusion, and specular reflection are main challenges for any iris segmentation method. In thi...

متن کامل

Improving the dictionary lookup approach for disease normalization using enhanced dictionary and query expansion

The rapidly increasing biomedical literature calls for the need of an automatic approach in the recognition and normalization of disease mentions in order to increase the precision and effectivity of disease based information retrieval. A variety of methods have been proposed to deal with the problem of disease named entity recognition and normalization. Among all the proposed methods, conditio...

متن کامل

A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features

Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of biomedical informatics

دوره 37 6  شماره 

صفحات  -

تاریخ انتشار 2004